60 research outputs found

    Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies

    Get PDF
    Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science

    Greater power and computational efficiency for kernel-based association testing of sets of genetic variants

    Get PDF
    Motivation: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test-a score test-with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene-gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods. Results: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test-up to 23 more associations-whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene-gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500. Availability: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/. Contact: [email protected] Supplementary Information: Supplementary data are available at Bioinformatics online

    Further improvements to linear mixed models for genome-wide association studies

    Get PDF
    We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science

    HLA-driven optimization of an HIV vaccine immunogen

    Get PDF
    Background: HIV diversity has been driven in large part by the intense selective pressure of HLA-restricted immune responses and is a significant challenge in HIV vaccine design. Sites of HLA-associated polymorphisms indicate potential immunogenic peptides that should be incorporated into an HIV vaccine. Method: Full-length (pretreatment) HIV sequencing and high-resolution HLA-A, -B, and -C genotyping was undertaken on 245 individuals in the Western Australian HIV Cohort Study. We determined statistically significant associations between polymorphisms in HIV sequences and HLA genotypes. Given these HLA associations we consider alternative measures of protection on the basis of the match between a viral peptide sequence and a corresponding segment of the vaccine. The measure is defined for all overlapping HIV peptides in the dataset. Each peptide contains a putative epitope and its associated flanking region. The vaccine is said to protect against a peptide sequence if the sites of HLA association in both the peptide sequence and the corresponding segment of the vaccine have nonescaped amino acids, and one of the following three criteria hold: (1), "no play"β€” the remaining sites in the peptide sequence and corresponding segment of the vaccine match exactly, (2), "mid-play"β€” the remaining sites in the sequence and vaccine differ only by conservative amino-acid substitutions, and (3) "full-play"β€”the remaining sites in the sequence and vaccine need have no relationship.The three criteria represent different assumptions about the degree to which T cells cross-react. An optimal vaccine immunogen of a given length is the one that contains the largest number of (possibly overlapping) protected against peptides. We provide a general machine-learning approach to optimization of such immunogens. Results: We optimized vaccines of length up to 2000 aa. The predicted efficacy of the optimized vaccine immunogens depends considerably on which criterion is used. For instance, an optimized vaccine immunogen of length 1300aa can protect against all peptides in the data under the full-play assumption, compared with 80% of all peptides under the mid-play assumption and 65% under the no-play assumption. Conclusion: These data demonstrate a novel, rational approach to optimizing the immunogenicity of an HIV vaccine against diverse circulating viruses in a human population, guided by knowledge of the population HLA

    An exhaustive epistatic SNP association analysis on expanded Wellcome Trust data

    Get PDF
    We present an approach for genome-wide association analysis with improved power on the Wellcome Trust data consisting of seven common phenotypes and shared controls. We achieved improved power by expanding the control set to include other disease cohorts, multiple races, and closely related individuals. Within this setting, we conducted exhaustive univariate and epistatic interaction association analyses. Use of the expanded control set identified more known associations with Crohn's disease and potential new biology, including several plausible epistatic interactions in several diseases. Our work suggests that carefully combining data from large repositories could reveal many new biological insights through increased power. As a community resource, all results have been made available through an interactive web server

    Technological Advances to Address Current Issues in Entomology: 2020 Student Debates

    Get PDF
    The 2020 Student Debates of the Entomological Society of America (ESA) were live-streamed during the Virtual Annual Meeting to debate current, prominent entomological issues of interest to members. The Student Debates Subcommittee of the National ESA Student Affairs Committee coordinated the student efforts throughout the year and hosted the live event. This year, four unbiased introductory speakers provided background for each debate topic while four multi-university teams were each assigned a debate topic under the theme β€˜Technological Advances to Address Current Issues in Entomology’. The two debate topics selected were as follows: 1) What is the best taxonomic approach to identify and classify insects? and 2) What is the best current technology to address the locust swarms worldwide? Unbiased introduction speakers and debate teams began preparing approximately six months before the live event. During the live event, teams shared their critical thinking and practiced communication skills by defending their positions on either taxonomical identification and classification of insects or managing the damaging outbreaks of locusts in crops

    Learning Transcriptional Regulatory Relationships Using Sparse Graphical Models

    Get PDF
    Understanding the organization and function of transcriptional regulatory networks by analyzing high-throughput gene expression profiles is a key problem in computational biology. The challenges in this work are 1) the lack of complete knowledge of the regulatory relationship between the regulators and the associated genes, 2) the potential for spurious associations due to confounding factors, and 3) the number of parameters to learn is usually larger than the number of available microarray experiments. We present a sparse (L1 regularized) graphical model to address these challenges. Our model incorporates known transcription factors and introduces hidden variables to represent possible unknown transcription and confounding factors. The expression level of a gene is modeled as a linear combination of the expression levels of known transcription factors and hidden factors. Using gene expression data covering 39,296 oligonucleotide probes from 1109 human liver samples, we demonstrate that our model better predicts out-of-sample data than a model with no hidden variables. We also show that some of the gene sets associated with hidden variables are strongly correlated with Gene Ontology categories. The software including source code is available at http://grnl1.codeplex.com

    Phylogenetic Dependency Networks: Inferring Patterns of CTL Escape and Codon Covariation in HIV-1 Gag

    Get PDF
    HIV avoids elimination by cytotoxic T-lymphocytes (CTLs) through the evolution of escape mutations. Although there is mounting evidence that these escape pathways are broadly consistent among individuals with similar human leukocyte antigen (HLA) class I alleles, previous population-based studies have been limited by the inability to simultaneously account for HIV codon covariation, linkage disequilibrium among HLA alleles, and the confounding effects of HIV phylogeny when attempting to identify HLA-associated viral evolution. We have developed a statistical model of evolution, called a phylogenetic dependency network, that accounts for these three sources of confounding and identifies the primary sources of selection pressure acting on each HIV codon. Using synthetic data, we demonstrate the utility of this approach for identifying sites of HLA-mediated selection pressure and codon evolution as well as the deleterious effects of failing to account for all three sources of confounding. We then apply our approach to a large, clinically-derived dataset of Gag p17 and p24 sequences from a multicenter cohort of 1144 HIV-infected individuals from British Columbia, Canada (predominantly HIV-1 clade B) and Durban, South Africa (predominantly HIV-1 clade C). The resulting phylogenetic dependency network is dense, containing 149 associations between HLA alleles and HIV codons and 1386 associations among HIV codons. These associations include the complete reconstruction of several recently defined escape and compensatory mutation pathways and agree with emerging data on patterns of epitope targeting. The phylogenetic dependency network adds to the growing body of literature suggesting that sites of escape, order of escape, and compensatory mutations are largely consistent even across different clades, although we also identify several differences between clades. As recent case studies have demonstrated, understanding both the complexity and the consistency of immune escape has important implications for CTL-based vaccine design. Phylogenetic dependency networks represent a major step toward systematically expanding our understanding of CTL escape to diverse populations and whole viral genes
    • …
    corecore